Quick analysis and similarity network building¶

In [1]:

%pylab inline
import matplotlib.pyplot as plt
import pandas as pd, numpy as np
import seaborn as sns
sns.set(style="whitegrid")
import sqlite3

Populating the interactive namespace from numpy and matplotlib

Data¶

The data used here are downloaded data from the Million Song Dataset:

SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).

Pandas can read data directly from the SQLite databases.

Connect to both db¶

In [2]:

conn_terms = sqlite3.connect("data/artist_term.db")

In [3]:

conn_artists = sqlite3.connect("data/track_metadata.db")

Get the mbtags¶

These are human selected tags from musicbrainz.org. They include a large variety of tags: countries, genre, languages.

In [4]:

mbtags = pd.read_sql("SELECT * FROM artist_mbtag",conn_terms)

In [5]:

print("{:,}".format(len(mbtags["artist_id"].unique())), "unique artists, and", \
      "{:,}".format(len(mbtags["mbtag"].unique())), "unique tags;", \
     "with an average of {0:0.2f}".format(cnt_art_tag["countTags"].mean()), "per artist")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
 in ()
----> 1 print("{:,}".format(len(mbtags["artist_id"].unique())), "unique artists, and",       "{:,}".format(len(mbtags["mbtag"].unique())), "unique tags;",      "with an average of {0:0.2f}".format(cnt_art_tag["countTags"].mean()), "per artist")

NameError: name 'cnt_art_tag' is not defined

In [ ]:

cnt_art_tag = pd.pivot_table(mbtags,values="mbtag",index="artist_id",\
                             aggfunc="count").reset_index().rename(columns={"mbtag":"countTags"})

In [ ]:

cnt_art_tag.hist(bins=21)

The distribution of count tag shows that mosts songs will have only a few tags. It will be hard to build a network of similarity with these!

Build a network¶

Source, target and edges¶

The data we use is a list of tuples ( artist_id , mbtag ) for 8,838 unique artists, with 1 to 20 tags (average: 2.8).

We want to create an undirected network of artists who share common tags:

Sources and targets are the artists (we copy the same dataframe and rename the ids.)
Edges will be created by merging the source on the target, using the tags as a common field. Then we remove the edges where the source and targets are the same.

In [ ]:

source = mbtags.copy().rename(columns={"artist_id":"Source"})
target = mbtags.copy().rename(columns={"artist_id":"Target"})
edges = source.merge(target,on="mbtag",how="inner")
edges = edges[edges["Source"] != edges["Target"]]

In [ ]:

print("There are", "{:,}".format(len(source)), \
      "sources/targets connected by","{:,}".format(len(edges)), "links")

In [ ]:

mbtags["nb_tags"] = mbtags.groupby(["artist_id"])["mbtag"].transform("count")

In [ ]:

len(mbtags[mbtags["nb_tags"]>4]),len(mbtags[mbtags["nb_tags"]>2]),len(mbtags),

Remove redundant edges¶

This network has 8 millions edges.. with many redundant edges, so we'll add a weight to the edges and drop the duplicates (keep only one link).

In [ ]:

edges["weight"] = edges.groupby(["Source","Target"])["mbtag"].transform("count")

In [ ]:

edges = edges[["Source","Target","weight"]].drop_duplicates()

In [ ]:

len(edges)

Add attributes¶

Artists names¶

The artist_ids are not very informative, so we need to use the track_metadata dataset in order to add the name of the artists. Since it refers to tracks, there are multiple duplicates for the (artist_id,artist_name) tuples. We'll drop them.

In [ ]:

tracks = pd.read_sql("SELECT artist_id,artist_name FROM songs",conn_artists)

In [ ]:

artists = tracks.drop_duplicates()

In [ ]:

print("There were", "{:,}".format(len(tracks)),"tracks, from", "{:,}".format(len(artists)), "artists")

If we look at the unique names of artists and ids, we realise that there is a discrepancy:

In [ ]:

print(len(artists.artist_id.unique()), "unique artists")
print(len(artists.artist_name.unique()), "unique artist names")

Argh, it seems that some artist_id have duplicate names!

e.g. below for AROQS6O1187FB3D146

In [ ]:

artists[artists["artist_id"] == "AROQS6O1187FB3D146" ]

Looking at it, some are mistakes, but most are simply variations of the name), so we'll drop the duplicate artist_id and keep only the first corresponding artist_name.

In [ ]:

artists.drop_duplicates("artist_id",inplace=True)

In [ ]:

len(artists),len(artists.artist_id.unique()),len(artists.artist_name.unique())

Add Tags list¶

The tags are currently in one column, with one value per row. We want them in one row per artist, so we can use a clever join with the pandas groupby.

In [ ]:

mbtags["Tags"] = mbtags.groupby(['artist_id'])['mbtag'].transform(lambda x: ','.join(x))

In [ ]:

artists = artists.merge(mbtags.drop_duplicates("artist_id"),how="left")

In [ ]:

len(artists),artists.columns

In [ ]:

artists[artists["artist_id"] == "AR00GVV11C8A415A54"]

Add source names¶

In [ ]:

edges = edges.merge(artists.rename(columns={"artist_id":"Source"}),\
                      how="left").rename(columns={"Source":"Source_id","artist_name":"Source"})

In [ ]:

len(edges)

In [ ]:

edges.head()

Add target names¶

In [ ]:

edges = edges.merge(artists,left_on="Target",right_on="artist_id",how="left")

In [ ]:

edges = edges.rename(columns={"Target":"Target_id","artist_name":"Target"})[["Source","Target","Source_id",\
                                                                    "Target_id","weight"]]

In [ ]:

len(edges),edges.columns,len(edges[edges["weight"] > 3])

We will keep only the most connected edges: with more than 3 links. We export to csv to use in gephi.

In [ ]:

edges[edges["weight"] > 3][["Source","Target","weight","Source_id","Target_id"]].to_csv("edges.csv",index=None)

Get the nodes from the edges¶

In [ ]:

nodes = pd.DataFrame(edges[edges["weight"] > 3]["Source_id"].tolist() + \
                     edges[edges["weight"] > 3]["Target_id"].tolist(),\
                     columns=["artist_id"]).drop_duplicates()

In [ ]:

nodes = nodes.merge(mbtags[["artist_id","Tags"]],how="left").drop_duplicates()

In [ ]:

len(nodes),nodes.columns

In [ ]:

nodes = nodes.merge(artists,how="left")[["artist_name","Tags"]].rename(columns={"artist_name":"id"}).drop_duplicates()

We export to csv to use them in Gephi.

In [ ]:

nodes.to_csv("nodes.csv",index=None)

Data science w/ a geo bias

Hélène Draux

Building a network (Part 1)

Quick analysis and similarity network building¶

Data¶

Connect to both db¶

Get the mbtags¶

Build a network¶

Source, target and edges¶

Remove redundant edges¶

Add attributes¶

Artists names¶

Add Tags list¶

Add source names¶

Add target names¶

Get the nodes from the edges¶

Related posts

Building a network (Part 1)

Quick analysis and similarity network building¶

Data¶

Connect to both db¶

Get the mbtags¶

Build a network¶

Source, target and edges¶

Remove redundant edges¶

Add attributes¶

Artists names¶

Add Tags list¶

Add source names¶

Add target names¶

Get the nodes from the edges¶

Related posts

NLP of a Facebook Page

locationnewoffice

Tips using Jupyter to publish on a blog

import from the __future__

Tkinter in Mac

import from the future